Skip to content

Conversation

@dmehala
Copy link
Contributor

@dmehala dmehala commented Apr 16, 2025

This change comes after a deep investigation into why certain jobs were failing when upgrading to the latest dd-trace-cpp version. The root cause turned out to be increased contention in the default curl HTTP client due to telemetry refactoring. This caused NGINX to take significantly longer to shut down, which in turn led to timeouts during reload_nginx.

While investigating, I also realized the logic in reload_nginx was flawed. It was only checking whether a worker process was running, rather than verifying that the old worker (by PID) had exited. Since NGINX spawns a new worker and sends a SIGQUIT to the old one during reloads, our previous check was resulting in false positives. I've updated the logic to explicitly wait for the old worker PID to terminate and confirm the new worker is up before proceeding.

Additionally, while fixing this, I encountered issues with some AppSec tests. These tests were intentionally causing the worker to fail during initialization (e.g. by using invalid config paths), but this left NGINX in a broken state for subsequent tests. To resolve this, the tests for
datadog_appsec_http_blocked_template_json,
datadog_appsec_http_blocked_template_html, and
datadog_appsec_ruleset_file now validate the existence of their required files during the configuration process instead of at worker startup.

@dmehala dmehala requested review from a team as code owners April 16, 2025 09:21
@dmehala dmehala requested review from cataphract and removed request for a team April 16, 2025 09:21
Comment on lines +34 to +35
if span.get("meta", {}).get("_dd.appsec.json"):
return json.loads(span["meta"]["_dd.appsec.json"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Code Quality Violation

too many nesting levels (...read more)

Avoid to nest too many loops together. Having too many loops make your code harder to understand.
Prefer to organize your code in functions and unit of code you can clearly understand.

Learn More

View in Datadog  Leave us feedback  Documentation

@dmehala dmehala force-pushed the dmehala/rework-tests branch from b08e007 to 9fe4f15 Compare April 16, 2025 09:25
@dmehala dmehala marked this pull request as draft April 16, 2025 09:51
if any(line for line in rest):
raise Exception("Unexpected trailing output to curljson.sh: " +
json.dumps(rest))
raise Exception(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Exception is too generic (...read more)

Do not raise Exception and BaseException. These are too generic. Having generic exceptions makes it difficult to differentiate errors in a program. Use a specific exception, for example, ValueError, or create your own instead of using generic ones.

Learn More

View in Datadog  Leave us feedback  Documentation

if status != 200:
raise Exception(
f"nginx returned error (status, body): {(status, body)}")
raise Exception(f"nginx returned error (status, body): {(status, body)}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Exception is too generic (...read more)

Do not raise Exception and BaseException. These are too generic. Having generic exceptions makes it difficult to differentiate errors in a program. Use a specific exception, for example, ValueError, or create your own instead of using generic ones.

Learn More

View in Datadog  Leave us feedback  Documentation

@dmehala dmehala force-pushed the dmehala/rework-tests branch from 210c0e6 to bf39393 Compare April 17, 2025 11:17
This change comes after a deep investigation into why certain jobs were
failing when upgrading to the latest dd-trace-cpp version.

The root cause turned out to be increased contention in the default curl HTTP
client due to telemetry refactoring. This caused NGINX to take significantly
longer to shut down, which in turn led to timeouts during reload_nginx.

While investigating, I also realized the logic in `reload_nginx` was flawed.
It was only checking whether a worker process was running, rather than verifying
that the old worker (by PID) had exited. Since NGINX spawns a new worker and
sends a SIGQUIT to the old one during reloads, our previous check was resulting
in false positives. I've updated the logic to explicitly wait for the old worker
PID to terminate and confirm the new worker is up before proceeding.

Additionally, while fixing this, I encountered issues with some AppSec tests.
These tests were intentionally causing the worker to fail during initialization
(e.g. by using invalid config paths), but this left NGINX in a broken state for
subsequent tests. To resolve this, the tests for
`datadog_appsec_http_blocked_template_json`,
`datadog_appsec_http_blocked_template_html`, and
`datadog_appsec_ruleset_file` now validate the existence of their required files
during the configuration process instead of at worker startup.
@dmehala dmehala force-pushed the dmehala/rework-tests branch from bf39393 to 50fec9b Compare April 17, 2025 11:22
@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.20%. Comparing base (4e94d09) to head (1ce8ad9).
Report is 5 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #197      +/-   ##
==========================================
- Coverage   70.26%   70.20%   -0.06%     
==========================================
  Files          47       48       +1     
  Lines        6224     6233       +9     
  Branches      882      883       +1     
==========================================
+ Hits         4373     4376       +3     
- Misses       1423     1429       +6     
  Partials      428      428              
Files with missing lines Coverage Δ
src/common/directives.cpp 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dmehala dmehala marked this pull request as ready for review April 17, 2025 12:21
f"{timeout_seconds} seconds timeout exceeded while waiting for nginx workers to stop. {now - before} seconds elapsed."
)
time.sleep(poll_period_seconds)
def old_worker_stops(worker_pid):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_worker_pids is a set, and then treated as _worker_pid the code is doing set not in another_set which is always true, or am I getting something wrong?

The old code was looping while the intersection of the old pids and the current pids was non-empty, continuing once empty, which seems to be correct when expecting pids to stop, but was not really checking for new processes to spawn 🤔

Copy link
Contributor Author

@dmehala dmehala Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I know David for writing high quality code, I knew I missed something, thank you for taking the time to investigate and understand the prior code.

the code is doing set not in another_set which is always true, or am I getting something wrong?

I intended element not in another set but as you mentioned old_worker_pids is a set. It's working because wait_until stop waiting when the predicate is True which is always the case here. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 1ce8ad9

Copy link
Collaborator

@pablomartinezbernardo pablomartinezbernardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmehala dmehala merged commit 45ce45e into master Apr 26, 2025
49 checks passed
@dmehala dmehala deleted the dmehala/rework-tests branch April 26, 2025 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants